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Abstract — Clustering is a useful technique that organizes a large quantity of unordered text documents into a small 
number of meaningful and coherent cluster, thereby providing a basis for intuitive and informative navigation and 
browsing mechanisms. There are some clustering methods which have to assume some cluster relationship among the 
data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. The 
major difference between a tradil . s ' larity measure and ours is that the former uses only a only a 

single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not 
be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of 
could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion 
functions for document clustering are proposed based on this new measure. We compare them with several well-known 
clustering algorithms that use other popular similarity measures on various document collections to verify the advantages 
of our proposal. 
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I. INTRODUCTION 

Clustering in general is an important and useful technique that automatically organizes a collection with a 
substantial number of data objects into a much smaller number of coherent groups [1] .The aim of clustering is to find 
intrinsic structures in data, and organize them into meaningful subgroups for further study and analysis. There have been 
many clustering algorithms published every year. They can be proposed for very distinct research fields, and developed 
using totally different techniques and approaches. Nevertheless, according to a recent study [2] more than half a century after 
it was introduced: the simple algorithm k means still remains as one of the top 10 data minim il torithms nowadays. It is the 
most frequently used pariiiional clustering algorithm in practice. Another recent scientific discussion [3] states that k means 
is the favorite algoi ithm thai practitioners in the related fields choose to use. /f-means has more than a few basic drawbacks, 
such as sensitiveness to initialization and to cluster size, difficulty in comparing quality of the clusters produced and its 
performance can be worse than other slate of the art algorithms in many domains. In spite of that, its simplicity, 
understandability and scalability are the reasons for its tremendous popularity. While offering reasonable results, k-means is 
last and easy to combine with other methods in larger systems. A common approach to the clustering problem is to treat it as 
an optimization process. An optimal partition is found by optimizing a particular function of similarity (or distance) among 
data. Basically, there is an implicit assumption that the true intrinsic structure of data could be correctly described by the 
similarity formula defined and embedded in the clustering criterion function. Hence, effectiveness of clustering algorithms 
under this approach depends on the appropriateness of the similarity measure to the data at hand. For instance, the original 
K means has sum of squared -error objective function that uses Euclidean distance. In a very sparse and high dimensional 
domain like text documents, spherical k means, which uses cosine similarity instead of Euclidean distance as the measure, is 
deemed to be more suitable [4], [5]. A variety of similarity or distance measures have been proposed and widely applied, 
such as cosine similarity and the Jaccard correlation coefficient. Meanwhile, similarity is often conceived in terms of 
dissimilarity or distance [6J.Measures such as Euclidean distance and relative entropy has been applied in clustering to 
calculate the pair- wise distances. 

The Vector-Space Model is a popular model in the information retrieval domain [7] .In this model, each element in 
the domain is taken to be a dimension in a vector space. A collection is represented by a vector, with components along 
exactly those dimensions corresponding to the elements in the collection. One advantage of this model is that we can now 
weight the components of the vectors, by using schemes such as TF-IDF [8J.The Cosine Similarity Measure (CSM) defines 
similarity of two document vectors d t and dj , sim(d; , dj) , as the cosine of the angle between them. For unit vectors, this 
equals to their inner product: 

sm(di,dj) = cos (dt,dj) = d' dj (l) 

This measure has proven to be very popular for query-document and document-document similarity in text 
retrieval. Collaborative filtering systems such as GroupLens [9] use a similar vector model, with each dimension being a 
"vote" of the user for a particular item. However, they use the Pearson Correlation Coefficient as a similarity measure, which 
first subtracts the average of the elements from each of the vectors before computing their cosine similarity. Formally, this 
similarity is given by the formula: 
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Where, Xj is the value of vector X in dimension j , x is the average value of X along a dimension, and the 
ver all dimensions in which both X and Y are nonzero [9]. Inverse User Frequency may be used to weight the 
different components of the vectors. There have also been other enhancements such as default voting and case amplification 
[10], which modify the values of the vectors along Ihc various dimensions. In a provocative study, Ahlgren et al. questioned 
the use of Pearson's Correlation Coefficient as a similarity measure in Author Co-citation Analysis (ACA) with the 
argument that this measure is sensitive for zeros. Analytically, the addition of zeros to two variables should add to their 
similarity, but the authors show with empirical examples that this addition can depress the correlation coefficient between 
these variables. Saltan's cosine is suggested as a possible alternative because this similarity measure is insensitive to the 
addition of zeros [7]. In a reaction White defended the use of the Pearson correlation hitherto in ACA with the pragmatic 
argument that the differences between using different similarity measures can be neglected in the research practice. He 
illustrated this with dendrograms and mappm u in Jii 'ten et al.'s own data. Bcnsman contributed to the discussion with 
a letter in which he argued for using Pearson's r for additional reasons. Unlike the cosine, Pearson's r is embedded in 
multivariate statistics and because of the normalization implied this measure allows for negative values. The problem with 
the zeros can be solved by applying a logarithmic transformation to the data. In his opinion, this transformation is anyhow 
advisable in the case of a bivariate normal distribution. Leydesdorff & Zaal experimented with comparing results of using 
various similarity criteria — among which the cosine and the correlation coefficient — and different clustering algorithms for 
the mapping. Indeed, the differences between using the Pearson's r or the cosine were also minimal in our case. However, 
our study was mainly triggered by concern about the use of single linkage clustering in the ISI's World Atlas of Science 
[11]. The choice for this algorithm had been made by the ISI for technical reasons given the computational limitations of that 
time. The differences between using Pearson's Correlation Coefficient and Saltan's cosine are marginal in practice because 
the correlation measure can also be considered as a cosine between normalized vectors [12]. The normalization is sensitive 
to the zeros, but as noted this can be repaired b\ the logarithmic transformation. Mure generally, however, it remains most 
worrisome that one has such a wealth of both similarity criteria (e.g., Euclidean distances, the Jaccard index, etc.) and 
clustering algorithms (e.g., single linkage, average linkage, Ward's mode, etc.) available that one is able to generate almost 
any representation from a set of data [13]. The problem of how to estimate the number of clusters, factors, groups, 
dimensions, etc. is a pervasive one in multivariate analysis. In cluster analysis and multi dimensional scaling, decisions 
based upon visual inspection of the results are common. 

The following Table 1 summarizes the basic notations that will be used extensively throughout this paper to 
represent documents and related concepts. 



Notation 


Description 


n 


number of documents 


m 


number of terms 


c 


number of classes 


k 


number of clusters 


d 


document vector, II d II = 1 


S={dl,....,dn} 


set of all the documents 


S r 


set of documents in cluster r 


D =H d ,^ 


composite vector of all the documents 


&-?.**»* 


Composite vector of cluster r 


C = D/n 


centroid vector of all the documents 


C r =D r /n r 


centroid vector of cluster r, n, = s. 



II. RELATED WORK 

2.1 Clustering: 

Clustering can be considered the most unsupervised learning technique; so , as every other problem of this 
kind, it deals with finding a structure in a collection of unlabeled data. Clustering is the process of organizing objects into 
groups whose members are similar in some way.Therefore a cluster is a collection of objects which are similar between them 
and are dissimilar to the objects belonging to other clusters. Generally, clustering is used in Data Mining, Information 
Retrieval, Text Mining, Web Anah i larl in'" and Medical Diagnostic. 
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2.2 Document representation:. 

The various clustering algorithms represents each document using the well-known term frequency-inverse 
document frequencj (t) idf) vector space model (Saltan, 1989). In this model, each document d is considered to be a vector 
in the term-space and is represented by the vector 

dm = (tf ilog {n I dfx),tf ilog (n/df2),....,tf n \og (n/dfm)) (3) 

Where tf ; is the frequency of the i' h term (i.e., term frequency), n is the total number of documents, and df ; is the 
number of documents that contain the i th term (i.e., document frequency). To account for documents of different lengths, the 
length of each document vector is normalized so that it is of unit length. In the rest of the paper, we will assume that the 
vector representation for each document has been weighted using if idf and normalized so thai ii is of unit length. 

2.3 Similarity measures: 

Two prominent ways have been proposed to compute the similarity between two documents d t and dj. The 
first method is based on the commonly-used (Saltan, ! 989! cosine function: 

cos (di,dj) = dt'dj /(lid, II II <& II) (4) 

Since the document vectors arc of unit length, it simplifies to d-,d r The second method computes the similarity 
between the documents using the Euclidean distance dis (d it dj) =\\d t - dj\\. Note that besides the fact that one measures 
similarity and the other measures distance, these measure 1 , are quite similar to each other because the document vectors are 
of unit length. 

III. OPTIMIZATION ALGORITHM 

Our goal is to perform document clustering by optimizing criterion functions I R and I v [clustering with MVSC]. To 
achieve this, we utilize the sequential and incremental version of k means [14] | 15]. which are guaranteed to converge to a 
local optimum. This algorithm consist-, of a number of iterations: initially, k seeds are selected randomly and each document 
is assigned to cluster of closest seed based on cosine similarity; in each of the subsequent iterations, the documents are 
picked in random order and, for each document, a move to a new cluster takes place if such move leads to an increase in the 
objective function. Particularly, considering that the expression of I v [clustering with MVSC] depends only on n T and D r 
r=l ... k, let us represent I v in a general form 



L = ^Ir(nr,Dr) (5) 



Assume that, at beginning of some iteration a document d t belongs to a cluster S P that has objective value I P (n P , 
D P ).dj will be moved to another cluster S q that has objective value I q ( Hq,D q ) if the following condition is satisfied: 
A/v = I P {n P -\,D P -di) + (I q (n q + l,D q + di) - I P (n P ,D P ) - I q {n q ,D q ) (6) 

st.q = argmax {L (n r + l,Dr + dt) - L{n r ,Dr)} 

Hence, document d t is moved to a new cluster that gives the largest increase in the objective function, if such an 
increase exists. The composite vectors of corresponding old and new clusters are updated instantly after each move. If a 
maximum number of iterations is reached or no more move is detected, the procedure is stopped. A major advantage of our 
clustering functions under this optimization scheme is that they are very efficient computationally. During the optimization 
process, the main computational demand is from searching for optimum clusters to move individual documents to, and 
updating composite vectors as a result of such moves. If T denotes the number of iteration' the algorithm takes, nz the total 
number of non-zero entries in all document vectors, the computational complexity required for clustering with I R and I v is 
approximately O (nz.k.T). 

IV. EXISTING SYSTEM 

The principle definition of clustering is to arrange data objects into separate clusters such that the intra-cluster 
similarity as well as the inter-cluster dissimilarity is maximized. The problem formulation itself implies that some forms of 
measurement are needed to determine such similarity 01 dissimilarity. There are manj state oi -the art clustering approaches 
that do not employ any specific form of measurement, for instance, probabilistic model based method [16], and non-negative 
matrix factorization [17] .Instead of that Euclidean distance is one of the most popular measures. It is used in the traditional 
fc-means algorithm. The objective of fe-means is to minimize the Euclidean distance between objects of a cluster and that 
cluster's centroid: 

- II 2 (7) 

However, for data in a sparse and high dimensional space, such as that in document clustering, cosine similarity is 
more widely used. It is also a populai similarity score in text mining and information retrieval [18]. Cosine measure is used 
in a variant of fc-means called spherical fc-means [4]. While /r-means aims to minimize Euclidean distance, spherical A means 
intends to maximize the cosine similarity between documents in a cluster and that cluster's centroid: 



4r ri ucrU 



max > > s£^f (8) 
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The major difference between Euclidean distance and cosine similarity, and therefore between k-means and 
spherical k-means, is that the former focuses on vector magnitudes, while the latter emphasizes on vector directions. Besides 
direct application in spherical k-means, cosine of document vectors is also widely used in many other document clustering 
methods as a core similarity measurement. The cosine similarity in Eq. (1) can be expressed in the Following form without 
changing its meaning: 



sin (di,dj) = cos (dt -0,4,-0) = (dt-0)' (dj-0) (9) 

Where, is vector that represents the origin point. According to this formula, the measure takes as one and only 
reference point. The similarity between two documents rf, and dj is determined with respective to the angle between the two 
points when looking from the origin. To construct a new concept of similarity, it is possible to use more than just one point 
of reference. We may have a more accurate assessment of how close or distant a pair of points is, if we look at them from 
many different viewpoints. From a third point d h the directions and distances to dj and dj are indicated respectively by the 
difference vectors (d t - d h ) and (dj - d h ). By standing at various reference points d h to view d it dj and working on their 
difference vectors, we define similarity between the two documents as: 



sin (di,dj) - 



1 



^ sin (d,-dh,dj-dh) 



(10) 



As described by the above equation, similarity of two documents dj and dj - given that they are in the same cluster - 
is defined as the average of similarities measured relatively from the views of all other documents outside that cluster. What 
is interesting is that the similarity here is defined in a close relation to the clustering problem. A presumption of cluster 
memberships has been made prior to the measure. The two objects to be measured must be in the same cluster, while the 
points from where to establish this measurement must be outside of the cluster. We call this proposal the Multi-Viewpoint 
based Similarity, or MVS. Existing systems greedily picks the next frequent item set which represent the next cluster to 
minimize the overlapping between the documents that contain both the item set and some remaining item sets. In other 
words, the clustering icsull depends on the order of picking up the item sets, which in turns depends on the greedy heuristic. 
This method does not follow a sequential order of selecting clusters. Instead, we assign documents to the best cluster. 

V. PROPOSED SYSTEM 

The main work is to develop a novel hierarchal algorithm for document clustering which provides maximum 
efficiency and performance. A hierarchical algorithm clustering algorithm is based on the union between the two nearest 
clusters. The beginning condition is realized by setting every datum as a cluster. After a few iterations, it reaches the final 
clusters wanted. The final category of probabilistic algorithms is focused around model matching using probabilities as 
opposed to distances to decide clusters. It is particularly focused in studying and making use of cluster overlapping 
phenomenon to design cluster merging criteria. Proposing a new way to compute the overlap rate in order to improve time 
efficiency and "the veracity" is mainly concentrated. Based on the Hierarchical Clustering Method, the usage of Expectation- 
Maximization (EM) algorithm in the Gaussian Mixture Model to count the parameters and make the two sub-clusters 
combined when their overlap is the largest is narrated. Here, the data set is usually modeled with a fixed (to avoid 
overfitting) number of Gaussian distributions thai arc initialized randomly and whose parameters are iteratively optimized to 
fit better to the data set. This will converge to a local optimum, so multiple runs may produce different results. In order to 
obtain a hard clustering, objects are often then assigned to the Gaussian distribution they most likely belong to, for soft 
clustering this is not necessary. 



<s* 
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Fig.l On Gaussian-distributed data, EM works well, s 



■s Gaussians for modeling clusters 



Experiments in document clustering data show that this approach can improve the efficiency of clustering and s< 
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Fig. 2 System Architecture 

l : ig. 2 shows the basic system architecture. Given a data set satisfying the distribution of a mixture of Gait 
the degree of overlap between components affects the number of clusters "perceived" by a human operator or detected by a 
clustering algorithm. In other words, there may be a significant difference between intuitively defined clusters and the true 
clusters mixture. 

VI. CONCLUSION 

The key contribution of this paper is the fundamental concept of similarit) measure from multiple viewpoints. 
Theoretical analysis show that Multi-viewpoint based similarity measure (MVS) is potentially more suitable for text 
documents than the popular cosine similarity measure. The future methods could make use of the same principle, but define 
alternative forms for the relative similarity in or do not use average but have other methods to combine the relative 
similarities according to the different viewpoints. In future, it would also be possible to apply the proposed criterion 
functions for hierarchical clustering algorithms. It would be interesting to explore how they work types of sparse and high- 
dimensional data. 
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